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Abstract The /c-nearest neighbors (/c-NN) classification rule 
has proven extremely successful in countless many com- 
puter vision applications. For example, image categorization 
often relies on uniform voting among the nearest prototypes 
in the space of descriptors. In spite of its good generalization 
properties and its natural extension to multi-class problems, 
the classic k-NN rule suffers from high variance when deal- 
ing with sparse prototype datasets in high dimensions. A few 
techniques have been proposed in order to improve k-NN 
classification, which rely on either deforming the nearest 
neighborhood relationship by learning a distance function or 
modifying the input space by means of subspace selection. 

In this paper, we propose a novel boosting algorithm, 
called UNN (Universal Nearest Neighbors), which induces 
leveraged k-NN, thus generalizing the classic k-NN rule. 
Our approach consists in redefining the voting rule as a strong 
classifier that linearly combines predictions from the k clos- 
est prototypes. Therefore, the k nearest neighbors examples 
act as weak classifiers and their weights, called leveraging 
coefficients, are learned by UNN so as to minimize a surro- 
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gate risk, which upper bounds the empirical misclassifica- 
tion rate over training data. A major feature of UNN is the 
ability to learn which prototypes are the most relevant for a 
given class, thus allowing one for effective data reduction by 
filtering the training data. 

Experimental results on the synthetic two-class dataset 
of Ripley show that such a filtering strategy is able to reject 
"noisy" prototypes, and yields a classification error close to 
the optimal Bayes error. We carried out image categorization 
experiments on a database containing eight classes of natural 
scenes. We show that our method outperforms significantly 
the classic A:-NN classification, while enabling significant 
reduction of the computational cost by means of data filter- 
ing. 

Keywords Boosting • k nearest neighbors • Image 
categorization • Scene classification 



1 Introduction 

1 . 1 Generic visual categorization 

In this paper, we address the problem of generic visual cate- 
gorization. This is a relevant task in computer vision, which 
aims at automatically classifying images into a discrete set 
of categories, such as indoor vs outdoor, beaches vs moun- 
tains, churches vs towers. Generic categorization is distinct 
from object and scene recognition, which are classification 
tasks concerning particular instances of objects or scenes 
(e.g. Notre Dame Cathedral vs St. Peter's Basilic). It is also 
distinct from other related computer vision tasks, such as 
content-based image retrieval (that aims at finding images 
from a database, which are semantically related or visually 
similar to a given query image) and object detection (which 
requires to find both the presence and the position of a target 
object in an image, e.g. person detection). 
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Automatic categorization of generic scenes is still a chal- 
lenging task, due to the huge number of natural categories 
that should be considered in general. In addition, natural im- 
age categories may exhibit high inter-class variability (i.e., vi- 
sually different images may belong to the same category) 
and low inter-class variability (i.e., distinct categories may 
contain visually similar images). 

Classifying images requires a reliable description of the 
content relevant for an application (e.g., location and shape 
of specific objects or overall scene appearance). Examples 
of suitable image descriptors for categorization purposes are 
Gist, i.e. global image features representing the overall scene 



rather a continuous estimation of class membership prob- 
abilities fHolmes and Adams| 2QQ3| ). This problem has been 



(Oliva and Torralba 2001) , and SIFT descriptors, i.e. de- 
scriptors of local features extracted at salient patches ( |Lowe 
|2QQ?1 ). 

Gist descriptor is based on the so-called "spatial enve- 



lope" ( |Qliva and Torralba 2001 ), which is a very effective 
low dimensional representation of the overall scene based 
on spectral information. Such a representation bypasses seg- 
mentation, extraction of keypoints and processing of indi- 
vidual objects and regions, thus enabling a compact global 
description of images. Gist descriptors have been success- 
fully used for categorizing locations and environments, show- 
ing their ability to provide relevant priors for more specific 
tasks, like object recognition and detection ( Rubin et aTl|2003| ). 



1.2 k-NN classification 

Apart from the descriptors used to compactly represent im- 
ages, most image categorization methods rely on supervised 
learning techniques for exploiting information about known 
samples when classifying an unlabeled sample. Among these 
techniques, /c-NN classification has proven successful, thanks 
to its easy implementation and its good generalization prop- 
erties ( Shakhnarovich et al| |2006). Indeed, the /c-NN rule 
does not require explicit construction of the feature space 
and is naturally adapted to multi-class problems. Moreover, 
from the theoretical point of view, /c-NN classification prov- 
ably tends to the Bayes optimal when increasing the sample 
size. Although such advantages make /c-NN classification 
very attractive to practitioners, it is an algorithmic challenge 
to speed-up /c-NN queries and design schemes that scale-up 
well with large dimensional datasets ( [Shakhnarovich et al 



2006| ). Moreover, it is yet another challenge to reduce the 
misclassification rate of the k-NN rule, usually tackled by 
data reduction techniques ( |Hart| 1968| ). 

In a number of works, the classification problem has 
been reduced to tracking ill-defined categories of neighbors, 
interpreted as "noisy" ( [Brighton and Mellish| [2002[ ). Most 
of these recent techniques are in fact partial solutions to a 
larger problem related to nearest neighbors' error, which 
does not have to be the discrete prediction of labels, but 



reformulated by Marin et al 



( [2009| ) as a strong advocacy 



for the formal transposition of boosting to nearest neighbors 
classification. Such a formalization is challenging as near- 
est neighbors rules are indeed not induced, whereas all for- 
mal boosting algorithms induce so-called strong classifiers 
by combining weak classifiers (also induced, say by decision 
stumps). 

A survey of the literature shows that at least four differ- 
ent categories of approaches have been proposed in order to 
improve /c-NN classification: 

- learning local or global adaptive distance metric; 

- embedding data in the feature space (kernel nearest neigh- 
bors); 

- distance-weighted and difference-weighted nearest neigh- 
bors; 

- boosting nearest neighbors. 

The earliest approaches to generalizing the /c-NN clas- 
sification rule relied on learning an adaptive distance met- 



ric from training data. Refer to the seminal work of |Fuku- 



naga and Flick ( 1984) who presented an optimal global met- 



ric for k-NN. An analogous approach was later adopted by 
[Hastie and Tibshirani[ ([1996), who carried out linear dis- 
criminant analysis to adaptively deform the distance met- 
ric. Recently, Paredes (2006]) has proposed a method for 
learning a weighted distance, where weights can be either 
global (i.e., only depending on classes and features) or local 
(i.e., depending on each individual prototype as well). 

Other more recent techniques apply the nearest neigh- 
bors rule to data embedded in a high-dimensional feature 
space, following the kernel trick approach of support vec- 
tor machines. For example, [Yu et al[ p002| ) have proposed 
a straightforward adaptation of the kernel mapping to the 
nearest neighbors rule, which yields significant improvement 
in terms of classification accuracy. In the context of vision. 



a successful technique has been proposed by Zhang et al 



( [2006[ ), which involves a "refinement" step at classification 
time, without relying on explicitely learning the distance 
metric. This method trains a local support vector machine 
on nearest neighbors of a given query, thus limiting the most 
expensive computations to a reduced subset of prototypes. 

Another class of /c-NN methods rely on weighting near- 
est neighbors votes based on their distances to the query 
sample ( [Dudani 1976 ). Recently, Zuo et al (2008]) have pro- 
posed a similar weighting approach, where the nearest neigh- 
bors are weighted based on their vector difference to the 
query. Such a difference-weight assignment is defined as a 
constrained optimization problem of sample reconstruction 
from its neighborhood. The same authors have proposed a 
kernel-based non-linear version of this algorithm as well. 
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Finally, only very few work have proposed the use of 
boosting techniques for /c-NN classification. For instance. 



aged k-NN classifiers are induced by UNN algorithm, which 



Amores et al (2006) use AdaBoost for learning a distance 



function to be used for k-NN search. On the other hand, 
[Garcia-Pedrajas and Ortiz-Boyer ( 2009| ) adopt the boosting 
approach in a non-conventional way. At each iteration a dif- 
ferent /c-NN classifier is trained over a modified input space. 
Namely, the authors propose two variants of the method, 
depending on the way the input space is modified. Their 
first algorithm is based on optimal subspace selection, i.e., 
at each boosting iteration the most relevant subset of input 
data is computed. The second algorithm relies on modify- 
ing the input space by means of non-linear projections. But 
neither method is strictly an algorithm for inducing weak 
classifiers from the /c-NN rule, thus not directly addressing 
the problem of boosting /c-NN classifiers. Moreover, such 
approaches are computationally expensive, as they rely on a 
genetic algorithm and a neural network, respectively. 

Conversely, we propose a complete solution to the prob- 
lem of boosting /c-NN classifiers in the general multi-class 
setting. Namely, we propose a novel boosting algorithm, called 
UNN, which induces a leveraged nearest neighbors rule that 
generalizes the uniform /c-NN rule. Indeed, the voting rule is 
redefined as a strong classifier that linearly combines weak 
classifiers induced by the /c-NN rule. Therefore, our approach 
does not need to learn a distance function, as it directly op- 
erates on the top of /c-nearest neighbors search. At the same 
time, it does not require an explicit computation of the fea- 
ture space, thus preserving one of the main advantages of 
prototype-based methods. Our UNN boosting algorithm is 
an iterative procedure that learns the weights of weak classi- 
fiers, called leveraging coefficients. We show that this algo- 
rithm converges to the global m inimum of any chos en clas- 
sification calibrated surrogat^Bdi VX^QVi et al 2006| ). Hence, 
our framework handles most popular losses in the machine 
learning literature: squared loss, exponential loss, logistic 
loss, etc. In particular, we prove a specific convergence rate 
for the exponential loss (reported in our experiments) far 
better than the general rate of |Nock and Nielsen] ( |2009| ). An- 
other important characteristic of UNN is that it is able to 
discriminate the most relevant prototypes for a given class, 
thus allowing one for significant data reduction while im- 
proving at the same time classification performances. 



is detailed in Sec. |2.4| for the case of exponential risk. Sec. [23 
presents the generic convergence theorem of UNN and the 
upper bound performance for the exponential risk minimiza- 
tion. Our experiments on both synthetic and image catego- 
rization datasets are reported in Sec. |3] Then, Sec. [4] dis- 
cusses results and mentions future work. 

In order not to laden the body of the paper, the general 
form of UNN algorithm and proofsketches of our theorems 
have been postponed to an appendix in Sec. [5] 



2 Method 

2.1 Problem statement and notation 

In this work, we address the task of multi-class, single-label 
image categorization. Hence, several categories of images 
are predefined, whereas each image is constrained to belong 
to a single category. The number of categories (or classes) 
may range from a few to hundreds, depending on appli- 
cations. E.g., categorization with 67 Indoor categories has 
been recently studied by |Quattoni and Torralba| ( [2Q09) ). We 
treat the multi-class problem as multiple binary classifica- 
tion problems as it is customary in machine learning. I.e., 
for each class c, a query image is classified either to c or to c 
(the complement class of c, which contains all classes but c) 
with a certain confidence {classification score). Then the la- 
bel with the maximum score is assigned to the query. Im- 
ages are represented by descriptors related to given local or 
global features. We refer to an image descriptor as an obser- 
vation o e O, which is a vector of n features and belongs to 
a domain O (e.g., or [0, 1]^). A label is associated to each 
image descriptor according to a predefined set of C classes. 
Hence, an observation with the corresponding label leads to 
an example, which is the ordered pair (o, t/) G (9 x Mp , 
where y is termed the class vector that specifies the class 
memberships of o. In particular, the sign of yc gives the 
membership of example (o, y) to class c, such that yc is neg- 
ative iff the observation does not belong to class c, positive 
otherwise. At the same time, the absolute value of yc may 
be interpreted as a relative confidence in the membership. 
Inspired by the multi-class boosting analysis of |Zhu et~aT| 
( |2006| ), we constrain class vectors to be symmetric, that is: 



1.3 Overview of the paper 

In the following sections we present our approach to /c-NN 
boosting. Sections TTpJ present key definitions for /c-NN 
boosting. These sections also describe how to replace the 
classic uniform /c-NN rule by a leveraged /c-NN rule. Lever- 

^ A surrogate is a function which is a suitable upperbound for an- 
other function (here, the non-convex non-differentiable empirical risk). 



(1) 



c=l 



Hence, in the single-label framework, the class vector of an 
observation o belonging to class c is defined as: = 1. 
yc^c = — c~[ • "^his setting turns out to be necessary when 
treating multi-class classification as multiple binary classifi- 
cations, as it balances negative and positive labels of a given 
example over all classes. We are given an input set of m 
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examples S = {(0^,7/^), i = 1, 2, m}, arising from an- 
notated images, which form the training set. 



2.2 Boosting /c-NN for minimization of surrogate risks 

We aim at defining a one-versus-all classifier for each cate- 
gory, which is to be trained over the set of examples. This 
classifier is expected to correctly classify as many new ob- 
servations as possible, i.e. to predict their true labels. There- 
fore, we aim at determining a classification rule h from the 
example dataset, which is able to minimize the classification 
error over all possible new observations. But since the un- 
derlying class probability densities are generally unknown 
and difficult to estimate, defining a classifier in the frame- 
work of supervised learning can be viewed as fitting a clas- 
sification rule onto a training set S without overfitting. This 
corresponds to defining a classifier that correctly classifies 
most of the example data themselves, thus minimizing the 
classification error over the example dataset (empirical or 
true classification loss). Therefore, in the most basic frame- 
work of supervised classification, one wishes to train a clas- 
sifier on S, i.e. build a function h : O ^ Mp with the 
objective to minimize its empirical risk on S, defined as: 



£-(/i,5) = -^^^b(/i,i,c)<0] 



c 



mC 



(2) 



with [.] the indicator function (1 iff true, otherwise), called 
here the 0/1 loss, and: 



g{h,i,c) = Vichcioi) 



(3) 



the edge of classifier h on example {oi^yi) for class c. Tak- 
ing the sign of /ic in { — 1, +1} as its membership prediction 
for class c, one sees that when the edge is positive (resp. neg- 
ative), the membership predicted by classifier and the actual 
example's membership agree (resp. disagree). Therefore, ^ 
averages over all classes the number of mismatches for the 
membership predictions, thus measuring the goodness-of-fit 
of the classification rule on the training dataset. Provided 
that the example dataset has good generalization properties 
with respect to the unknown distribution of possible obser- 
vations, minimizing this empirical risk is expected to yield 
good accuracy when classifying unlabeled observations. Un- 
fortunately, minimizing the empirical risk is mathematically 
not tractable as it deals with non-convex optimization. In 
order to bypass this cumbersome optimization challenge, 
the current trend of supervised learning (including boosting 
and support vector machines) has replaced the minimiza- 
tion of the empirical risk ^ by that of a so-called surrogate 
risk ( Bartlett et al||20Q6) ), to make the optimization problem 
amenable. In boosting, it amounts to summing (or averag- 
ing) over classes and examples a real- valued function called 



the surrogate loss, thus ending up with the following rewrit- 
ing of ([2]): 

-J Cm 

e^(/i,5) = — ^^^(p(/i,i,c)) . (4) 

c=l i=l 

Important choices available for ijj include: 

r""' = (1 - xf , (5) 

V^^^P = exp(-x) , (6) 

V^^^s = log(l + exp(-x)) ; (7) 

^ is the squared loss ([Bartlett et al |2006| , ([6]) is the ex- 
ponential loss ( |Schapire"a nd Singerj 1999| ), and ^ is the 
logistic loss ( [Bartlett et all , 2006). 

Surrogates play a fundamental role in supervised learn- 
ing. They are upper bounds of the empirical risk with de- 
sirable convexity properties. Their minimization remarkably 
impacts on that of the empirical risk, thus enabling to pro- 
vide minimization algorithms with good generalization prop- 
erties ( Nock and Nielsen[[2009l ). 

In this paper, we move from recent advances in boost- 
ing with surrogate risks to redefine the k-NN classification 
rule. In particular, we concentrate on the exponential risk 
and provide a novel algorithm that learns a leveraged /c-NN 
classifier, while provably converging to the global optimum 
of a surrogate risk. Our algorithm, called UNN (Universal 
Nearest Neighbors), meets boosting-type convergence prop- 
erties under two mild assumptions on the training set: weak 
learning and weak coverage properties. In the Appendix, we 
also describe how the UNN algorithm generalizes to any 
surrogate loss, and provide the most general analysis. 

2.3 Leveraged k-NN rule 

In the following, we denote by NN/c(o^/) the set of the k- 
nearest neighbors (with integer constant k > 0) of an ex- 
ample (o^/ , y^f) in set S with respect to a non-negative real- 
valued "distance" function. This function is defined on do- 
main O and measures how much two observations differ 
from each other. This dissimilarity function thus many not 
necessarily satisfy the triangle inequality of metrics. (All 
experiments in this paper refer to nearest neighbors with 
respect to the Euclidean distance.) For sake of readability, 
we let i ^' denote an example {oi^yi) that belongs to 
NN/e(o^/). This neighborhood relationship is intrinsically 
asymmetric, i.e., i does not necessarily imply that 

Indeed, a nearest neighbor of does not necessarily 
contain among its own nearest neighbors. 

The /c-nearest neighbors rule (/c-NN) is the following 
multi-class classifier h = {he : c = l,2,...,C}(/c appears 
in the summation indices): 



[yic>o] , 



(8) 
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where he is the one-versus-all classifier for class c and square 
brackets denote the indicator function. Hence, the classic 
nearest neighbors classification is based on majority vote 
among the k closest prototypes. 

In this paper, we propose to weight the votes of nearest 
neighbors by means of real coefficients, thus generalizing 
^ to the following leveraged /c-NN rule = {h^ : c = 
1,2,...,C}: 



□ 


o 




o 


o 


□ 






o 









(9) 



where ajc 
in class c. 



G M is the leveraging coefficient for example j 
with j = 1, 2, m and c = 1, 2, C Hence, 
^ linearly combines class labels of the k nearest neighbors 
(defined in Sec. |2.1| ) with their leveraging coefficients. 

The main contribution of our work is to define a gen- 
eral algorithm (UNN) for learning these leveraging coef- 
ficients from training data. This algorithm operates on the 
top of classic /c-NN methods, for it does not affect the near- 
est neighbors search when inducing weak classifiers of ([9]). 
Indeed, it is independent on the way nearest neighbors are 
computed, unlike most of the approaches mentioned in Sec.|1.2 
which rely on modifying the neighborhood relationship via 
metric distance deformations or kernel transformations. 
Though, our approach is still fully compatible with any un- 
derlying (metric) distance and data structure for /c-NN search, 
as well as possible kernel transformations of the input space. 
For a given training set S of m labeled examples, we 

mxm g^^i^ ^YsiSS 



Fig. 1 A toy example of direct (left) and reciprocal (right) A; -nearest 
neighbors {k = 1) of an example j. Squares and circles represent ex- 
amples of positive and negative classes. Each arrow connects an exam- 
ple to its 1-NN. 



Therefore, fitting all aj^s so as to minimize the surrogate 



loss ( 12 ) is the main goal of our learning algorithm UNN for 
inducing the leveraged /c-NN classifier h^. 



2.4 UNN: learning ajc of leveraged /c-NN classifier 

We propose a novel classification algorithm which induces 
the leveraged nearest neighbors classifier (Eq. ^ in the 
multi-class one-versus-all framework. In this section, we ex- 
plain UNN specialized for the exponential risk minimiza- 
tion, with pseudo-code shown in Alg.[T] However, our anal- 
ysis is much more general, as it involves the broad class of 



define the /c-NN edge matrix R*^^^ G 
c= 1, 2, C dNock and Nielsen] [2009] ): 
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if j i 
otherwise 



(10) 



The name of R^^^ is justified by an immediate parallel with 
([Sj. Indeed, each example j serves as a classifier for each 
example i, predicting if j NN/c(oi), yjc otherwise, for 
the membership to class c. Hence, the j^^ column of matrix 
j^(c)^ rj^\ which is different from when choosing k > 0, 
collects all edges of "classifier" j for class c. Note that non- 
zero entries of this column correspond to the so-called recip- 
rocal nearest neighbors (R/c-NN) of j, i.e., those examples 
for which j is a neighbor (Fig. [TJ. It finally comes that the 
edge of the leveraged /c-NN rule on example i for class c is: 



g{h\i,c) = (R^^^a^^))^ ,c= 1,2,...,C 



(11) 



where a*^^^ collects all leveraging coefficients in a vector 
form for class c: a-^^ = Ofic, i = 1, 2, m. The expression 
of surrogate loss ^ can be written as follows after replacing 
the argument of ?/^(-) in (|4]) by ( 1 1 ): 



1 



c 



C=l 2 = 1 



JC 



(12) 



classification-calibrated surrogate risks ( [Bartlett et al 2006 ), 
and is postponed to Appendix in order not to burden the 
methodology. Like common boosting algorithms, UNN op- 
erates on a set of weights Wi (i = 1, 2, m) defined over 
training data. Such weights are repeatedly updated to fit all 
leveraging coefficients a*^^^ for class c (c = 1,2,..., C). At 
each iteration, the index to leverage, j G {1, 2, m}, is ob- 
tained by a call to a weak index chooser oracle Wic(., ., .), 
whose implementation is postponed to steps [A.l] and [A.2], 
detailed later on in this section. 

The training phase is implemented in a one-versus-all 
fashion, i.e. C learning problems are solved independently, 
and for each class c the training examples are considered as 
belonging to either class c or the complement class c, i.e. 
any other class. Eventually, one leverage coefficient (ajc) 
per class is learned for each weak classifier (indexed by j). 
In the Appendix, we show that Alg.[T]is a specialization of 
a very general classification algorithm, thus justifying the 
name "Universal Nearest Neighbors". In particular, Alg. [T] 
induces the leveraged /c-NN classifier by minimizing the ex- 
ponential surrogate risk ([6]), very much like regular boosting 
does it for inducing a weighted voting rule for a set of weak 
classifiers. 

The key observation when training weak classifiers with 
UNN is that, at each iteration, one single example (indexed 
by j) is considered as a prototype to be leveraged. Indeed, all 
the other training data are to be viewed as observations for 
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which j may possibly vote. In particular, due to k-NN vot- 
ing, j can be a classifier only for its reciprocal nearest neigh- 
bors (i.e., those data for which j itself is a neighbor, corre- 
sponding to non-zero entries in matrix ([TO]) on column j). 
This brings to a remarkable simplification when computing 



6j in step [I.l] and updating weights Wi in step [1.2] (Eq. 16 



11). Indeed, only weights of reciprocal nearest neighbors of 
j are involved in these computations, thus allowing us not 
to store the entire matrix R*^^\c = 1,2,...,C Note that the 
set of R/c-NN is splitted in two subsets, each containing ex- 
amples that agree (disagree) with the class membership of j, 
thus yielding the partial sums Wj' and w~ of ([isj). 

Note that when whichever Wj' or wJ is zero, Sj in ( |T6| ) 
is not finite. There is however a simple alternative, inspired 
by |Schapire and Singer] ( |1999| ), which consists in smooth- 
ing out 5j when necessary, thus guaranteeing its finiteness 
without impairing convergence. More precisely, we suggest 
to replace: 

^ (13) 



Algorithm 1: Universal Nearest Neighbors 
UNN(5) for i/j = V^^^P 
Input: S = {{oi,yi),i = 1,2, ...,m, Oi eO, yi e 

Let r^":^ = / ^'"^^^'^ ^ ^ 
\ otherwise ' 

Vi,j = l,2,...,m, c= 1,2,...,C; 

forc= 1,2, ...,Cdo 

Letajc^O, Vj = 1, 2, m; 

Letiui^l, Vi = 1, 2, m; 

ioYt = 1,2, ...,Tdo 

[I.O] Weak index chooser oracle: Let 

j ^ WlC({l,2,...,m},t); 

[1.1] Let 



m 
1 

m 



(14) 



Also note that step [I.O] relies on oracle Wic(., ., .) for 
selecting index j of the next weak classifier. We propose two 
alternative implementations of this oracle, as follows: 

[I.O.a] a lazy approach: we set T = m and let j be chosen 
by Wic({l,2, ...,m},t,c) either: (1) randomly, or (2) 
following the alphabetic order of classes; 

[I.O.b] the boosting approach: we pick T > m, and let j be 
chosen by Wic({l, 2, m}, t, c) such that Sj is large 
enough. Each j can be chosen more than once. 



E 



Wi, w- 



E 
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[1.2] Let 



■ Wiexp(-5jr\f), Vi : j^ki; 



(15) 



(16) 



(17) 



[1.3] Let ajc ^ ajc + Sj 
Output: hc{o.f) = Y^i^^if c^icVic, Vc = 1, 2, C 



better convergence bound than the general one (Nock and 



|Nielsen||2Q09] ). 

Computing this bound is based on defining a weak in- 
dex assumption (WIA), which is to nearest neighbors what 
the conventional weak learning assumption is to general in- 
duced classifiers ( [Schapire and Singer] |1999| ): 



There are also schemes mixing [I.O.a] and [I.O.b]: for exam 



pie, we may pick T 
once as in [I.O.a]. 



m, choose j as in [I.O.b], but exactly 



(WIA) let p 



(c)+ // (c) + 



). There exist some 



7 > and > such that the following two inequality 
holds for index j returned by Wic(., ., .): 



2.5 Properties of UNN 

In this section, we enunciate two fundamental theorems for 
UNN. The first theorem reports a general monotonic con- 
vergence property of UNN to the optimal loss, for any given 
surrogate function. The second theorem further refines this 
general convergence theorem by providing effective conver- 
gence bound for the exponential loss. 

Theorem 1 As the number of iteration steps T increases, 
UNN converges to realizing the global minimum of the 
surrogate risk at hand for any ip meeting conditions (i), 
(ii) and (Hi) above, (proof sketch in Appendix) 

Although we prove the boosting ability of UNN for all 
applicable surrogate losses, we choose to show in particular 
its behavior for the exponential loss tp^^^, which features far 



i/2| > 7 



{w) 



> 



V 



(18) 
(19) 



Theorem 2 If the WIA holds for r <T steps in UNN (for 
each c), then e^^^{h^^S) < exp(— 27^7^r). (proofsketch in 
Appendix) 



Inequality ( [18] ) is the usual weak learning assumption 
(Schapire and Singer] |1999]), when considering examples 



as weak classifiers. But a weak coverage assumption ( [T9] ) 
is needed as well, because insufficient coverage of the re- 
ciprocal neighbors could easily wipe out even the surrogate 
risk reduction potentially due to a large 7. In addition, even 
when classes are significantly overlapping, choosing k not 
too small is enough for the WIA to be met for a large num- 
ber of boosting rounds r, thus determining a potential harsh 
decrease of e^'^ {h^^S). This is important, as there are at most 
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m different weak classifiers available to Wic(., ., .), even 
when each one may be chosen more than once under the 
WIA. Last but not least, Theorem |2] also displays the fact 
that classification ([18]) may be more important than cover- 



age (19). 



3 Experiments 

In this section, we present experimental results of UNN vs 
plain /c-NN on both synthetic and real datasets. Such exper- 
iments allowed us to quantify the gains brought by boost- 
ing on nearest neighbors voting ( [Marin et all|2Q09| ). For this 
purpose, we first performed tests on two-class synthetic data 
to drill down into the performances of UNN (Sec. |3.1| ). In 
Sec. |3.2| we discuss the data reduction ability of our tech- 
nique. Then, we carried out experiments of multi-class scene 
categorization on a dataset of natural images and compared 



the results of UNN to plain k-NN classification (Sec. 3.3 ). 



3.1 Synthetic datasets 

We have drilled down into the experimental behavior of UNN 
using the synthetic Ripley's dataset ( [Ripley [ pi 994| ) with two 
classes denoted by P and N. Each population of this dataset 
is an equal mixture of two two-dimensional normally dis- 
tributed populations, which are equally likely. Training and 
test dataset (consisting of 250 and 1000 points, respectively) 
are shown in Figure[2j where the optimal classification bound- 
ary of the Bayes rule is also displayed. This corresponds to 
the best theoretical error rate of 8.0% ( |Ripley[|l994| ). 

Fig. |3] validates on this dataset the monotonous decay 
of the exponential risk ([6]), mathematically proved in Theo- 
rem[2]under the two basic weak index/learning assumptions. 
It also shows the effect of three different implementations of 
the Wic oracle (Sec. |2.5[ ). Note that the boosting approach 
for selecting weak classifiers provides much faster decay of 
the surrogate risk, thus outperforming the two tested "lazy" 
implementations. In these latter cases, the index j of the 
weak classifier at each UNN iteration was chosen either ran- 
domly or following the order of examples in their respective 
categories. 

Classification results for a range of values of k are shown 
in Fig.|4j They enable to draw two main conclusions: First, 
test errors display a robustness of UNN against variations of 
k. Second, filtering out even a large proportion 1 — ^ of ex- 
amples with the smallest | |a. 1 12 does not degrade classifica- 
tion performances, and can even significantly improve them. 
As witnessed by Fig.|4] values as small as 6> = 0.25 yields 
improvements that make the test error close to Bayes' . (E.g., 
see the minimum error of boosted /c-NN for 6> = 0.25, k = 
9.) We investigate such a data reduction ability of UNN in 
the following Section. 




3 




Fig. 2 Training and validation data for the Ripley's dataset. The Bayes 
boundary is also drawn as reported in ( |Ripley[|l994) . 



WIC 


= boosting 


WIC 


= random order 


WIC 


= alphabetic order _ 




250 



Fig. 3 Decrease of e^''"'' (h^S) as a function of T in UNN for the 
Ripley's dataset for different oracle implementations. Note that the 
boosting implementation ([I.O.b], Sec. |2.4} always guarantees mono- 
tonic decrease of the surrogate loss, until the weak assumptions are 
matched (red curve). Conversely, the lazy implementation ([I.O.a], 
Sec. \2A\ may select, at a given step, a classifier that does not match 
those assumptions, thus preventing the loss from strictly decreasing 
(see green and blue curves). 



■ H- k-NN 
UNN 9=1 
UNN 9=0.75 

UNN 9=0.5 

-0- UNN 9=0.25 




Fig. 4 Test error for UNN as a function of k for boosted k-NN. Bayes 
rule yields 8% optimal misclassification rate. 
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3.2 Filtering the prototype dataset 

Experiments on the synthetic data illustrate the significant 
precision improvement provided by filtering the prototype 
dataset. Assuming standard sampling assumptions (Sc hapirej 
et all 1 1998] ), filtering benefits from two positive effects. The 
first is a margin effect, well known for induced classifiers 
([Schapire et al[ |1998| ). The goodness-of-fit of the /c-NN rule 
is driven by the most accurate examples, i.e. those surrounded 
by examples of the same class, getting the largest ||a.||2. 
The least accurate ones, e.g. those located in overlapping 
regions between two classes, get the smallest. Discarding 
these latter examples tends to increase a gap between class 
clouds, but each cloud may shelter examples of different 
classes. Fortunately, filtering with boosting is accompanied 
by a subtle local repolarization of predictions which, as ex- 
plained in Figure [5] for = 0.25, makes this gap maximiza- 
tion translate to margin maximization, for which positive ef- 
fects on learning are known ( Schap ire et al[[l998 ). The sec- 
ond effect is structural: in nearest neighbors rules, the fron- 
tier between classes stems from the Voronoi cells of those 
least accurate examples. Discarding them separates better 
the classes, as witnessed by Fig. [5] Above all, it reduces the 
number of Voronoi cells involved in the class frontiers, thus 
reducing structural parameters (vc-dimension) of the clas- 
sifier, possibly buying a reduction of the test error as well 



(Schapire et al 1998). 



3.3 Image Categorization 

We tested our /c-NN boosting algorithm for image catego- 
rization. In particular, we used the global Gist descriptor of 
Qliva and Torralba| ( |2QQ1 1) in order to obtain a meaningful 



representation of images. This descriptor provides a global 
representation of a scene, while not requiring explicit seg- 
mentation of image regions and objects. In the typical set- 
ting, an image is represented by a single vector of dimen- 
sion 512, which collects features related to the spatial or- 
ganization of dominant scales and orientations in the im- 
age. This correspondence between images and descriptors is 
one of the main advantages of using global descriptors over 
representations based on bags of local features ( [Grauman 
[and DarreIT[ |2Q05| ). Indeed, global descriptors are straight- 
forwardly adapted to image categorization methods relying 
on machine learning techniques, as most of these techniques, 
from prototype-based to kernel-based, require any instance 
of a particular category to be represented by a single vector. 
In particular, this is the case of /c-NN classification, which 
explicitly relies on measuring one-to-one similarity between 
a query image and prototype images. In addition. Gist de- 
scriptors have proven successful in representing relevant con- 
textual information of natural scenes, which allows to com- 



pute meaninfgul priors for exploration tasks like object de- 
tection and localization ( [Rubin et al| [2003 j ) . 

The dataset we used contains 2688 color images of out- 
door scenes of size 256x256 pixels, divided in 8 categories: 
coast, mountain, forest, open country, street, inside city, tall 
buildings and highways. One example image of each cate- 
gory is shown in Fig.|6] 

To extract global descriptors from these images we used 
the matlab implementation by Torralbaj^ with the most com- 
mon settings: 4 resolution levels of the Gabor pyramid, 8 ori- 
entations per scale and 4x4 blocks. 

We used this database to validate UNN for different val- 
ues of /c. In particular, we concentrated on evaluating clas- 
sification performances when filtering the prototype dataset, 
i.e. retaining a proportion 6 of the most relevant examples 
as prototypes for classification. Such a data reduction capa- 
bility is one of the most interesting properties of UNN, as 
it favourably impacts on the computational cost of classi- 
fication, which grows at least logarithmically (at most lin- 
early) with the dataset size. Indeed, classification roughly 
amounts to searching for the k nearest neighbors among 
prototypes, which is 0{kdOm) for linear exhaustive search. 



0{kd\og{Om)) for fast kD-tree based search (Arya et al 



|1998| ) (d being the dimension of feature vectors, the pro- 
portion of retained classifiers). 

Fig. [7] shows results of 3-fold cross-validation in terms 
of the mean Average Precision (mAP)[^as a function of 0, 
for different values of k. Indeed, we randomly splitted the 
database into 3 distinct subsets, each containing 896 images. 
Then, for each fold, we used one of these subsets as training 
set, while validating on the two remaining subsets. In each 
experiment, UNN was run over the training set and a subset 
of the trained weak classifiers was retained as prototypes 
for classifying the test images. In particular, we selected 
all training images j with leveraging coefficients ajc^ c = 
1, 2, C, such that ajc > a > 0. Note from Fig. |7]that, 
even when fixing threshold a so as to retain all the exam- 
ples, the actual proportion 6 of prototypes is less than one, 
because UNN always discards the examples with null lever- 
aging coefficients, which do not match assumptions ( |18|19 ). 

We compared UNN with the classic k-NN classifica- 
tion. Namely, in order for the classification cost of /c-NN be 
roughly the same as UNN, we carried out random sampling 
of the prototype dataset for selecting proportion 6 (between 
10% and the whole set of examples). UNN significantly out- 
performs classic /c-NN, even increasingly with k, as shown 
inFig.iga). 



2 publicly available at http://people.csa il.mit.edu/torralba/code/ 
spatialenvelope/sceneRecognition.m 

The mAP was computed by averaging classification rates over cat- 
egories (diagonal of the confusion matrix) and then averaging those 
values over the 3 cross-validation folds (Qliva and Torralba^^2001 K 
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Fig. 5 Maps of positive/negative leveraging coefficients aj over training data for = 3 and three different values of 0. Examples of class N with 
negative a. (filled squares) and those of class P with positive a. (empty circles) predict class ?; similarly, empty squares and filled circles both 
correspond to membership prediction in N. For this reason, when = 0.25, filtering produces a clear-cut gap between the two possible membership 
predictions (but not between the original classes). The optimal Bayes boundary between classes is shown as well. Interestingly, while this frontier 
still does not separate the original classes (without error), it does separate the memberships predictions, with much larger minimal margin. The 
combination of the data reduction and polarity reversal for memberships has thus simplified the learning of *S, and eased the capture of the optimal 
frontier with nearest neighbors. 




coast forest highway inside city mountain open country street tall buildings 

Fig. 6 Examples of annotated images of the database containing 2688 images classified into 8 categories. 



Image categorization results confirm the trend observed 
on the synthetic data when filtering the prototype dataset. 
Hence, selecting a reduced set of prototypes limits over- 
fitting on training data, while improving classification per- 
formance on the test set (typically 3% improvement). Most 
interestingly, classification precision of UNN is very stable 
as a function of 0, as it is shown in Fig.[8jb), where the drop 
of UNN precision for the largest values of is due to in- 
cluding prototypes with negative leveraging coefficients as 
well. To summarize, UNN displays the ability to discrimi- 
nate the most relevant images of each class, thus inducing a 



classification rule robust to "noisy" prototypes arising from 
low inter-class variations. Adjusting the value of threshold 
a enables to remove those confusing prototypes, thus reduc- 
ing the representation of each category to a sparse subset of 
meaningful prototype images. 



Fig. 10 shows two examples of how the leveraged /c-NN 
rule may correct misclassifications due to the uniform /c-NN 
voting. E.g., in the first example, the classic and the boosted 
/c-NN methods are compared when classifying an image be- 
longing to class coast, with k = 11. The leveraged rule with 
as few as 20% of prototype images is able to correctly la- 
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Fig. 7 Classification performances of UNN compared to k-NN in 3-fold cross-validation. 



(a) (b) 

Fig. 8 Performances of A;-NN and UNN classification as a function of (a) k and (b) 0. (The best results obtained with each of the two methods are 
plotted.) 



bel the query image (first row). Below each nearest neigh- 
bor image we show its contribution to the classifier of ([9]): 
note that negative votes are significantly smaller than pos- 
itive ones (up to an order of magnitude), thus determining 
positive labeling with high prediction score h^, according to 
(|9]). On the contrary, uniform voting rule with all prototypes 
misclassifies the test image, not being able to reject contri- 
butions by "noisy" neighbor images. An example of proto- 



types selected by filtering the dataset is shown in Fig. [TT] 
where the leveraging coefficients refer to the first category 
(tall buildings) versus the remaining ones. 
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4 Conclusion 

In this paper, we contribute to fill an important void of NN 
methods, showing how boosting can be transferred to /c-NN 
classification. Namely, we propose a novel boosting algo- 
rithm, UNN (Universal Nearest Neighbors rule), for induc- 
ing a leveraged /c-NN rule. This rule generalizes classic k- 
NN to weighted voting where weights, the so-called leverag- 
ing coefficients, are iteratively learned by UNN. We prove 
that this algorithm converges to the global optimum of sur- 
rogate risks under very mild assumptions. 

Experiments on both synthetic and image categorization 
databases display that UNN provides significant performance 
improvements (up to the best possible performance of the 
Bayes rule). Moreover, UNN exhibits consistent data reduc- 
tion ability, which results in significant speed-ups for classi- 
fication (up to a factor 16 when removing 3/4 of the coeffi- 
cients). 

Our approach is built on the top of k-NN search, thus 
being fully compatible with existing techniques relying on 
metric distance learning ( Zhang et al, 2006 ) as well as sub- 
space projections like PC A ( |Jain^ ^2008 ) or kernel transfor- 
mations of the input space, which are expected to enable 
significant improvements of categorization performances. 
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5 Appendix 

Generic UNN algorithm The general version of UNN is 
shown in Alg. [2] This algorithm induces the leveraged k- 
NN rule ^ for the broad class of surrogate losses meeting 
conditions of |Bartlett et al| ( [20061 ), thus generalizing Alg.[T] 
Namely, we constrain t/; to meet the following conditions: 
(i) im(V^) = R+, (ii) V^(0) < (V^ is the conventional 
derivative of ip loss function), and (iii) ip is strictly convex 
and differentiable. (i) and (ii) imply that ip is classification- 
calibrated: its local minimization is roughly tied up to that 



of the empirical risk ( [Bartlett et al||2006 ). (iii) implies con- 
venient algorithmic properties for the minimization of the 
surrogate risk ( [Nock and Nielsen[ |2009). Three common ex- 
amples have been shown in Eq. ([6]-|5]). 



The main bottleneck of UNN is step [I.l], as Eq. ( [21 ) 
is non-linear, but it always has a solution, finite under mild 



assumptions (Nock and Nielsen 2009): in our case, 5j is 



guaranteed to be finite when there is no total matching or 
mismatching of example j's memberships with its recipro- 
cal neighbors', for the class at hand. The second column of 
Table [T] contains the solutions to ( [2T] ) for surrogate losses 
mentioned in Sec. 22_ Those solutions are always exact for 
the exponential loss (?/^®^p) and squared loss (?/;®^^); for the 
logistic loss (?/;^°^) it is exact when the weights in the recip- 
rocal neighborhood of j are the same, otherwise it is approx- 
imated. Since starting weights are all the same, exactness 
can be guaranteed during a large number of inner rounds de- 
pending on which order is used to choice the examples. Ta- 
ble [T] helps to formalize the finiteness condition on 5j men- 
tioned above: when either sum of weights in ( [20| ) is zero, the 
solutions in the first and third line of Table [T] are not finite. 
A simple strategy to cope with numerical problems arising 



from such situations is that proposed by Sch apire and Singer 
([1999]). (See Sec. [24]) Table [T] also shows how the weight 
update rule ([22]) specializes for the mentioned losses. 



Proof sketch of Theorem^ We show that UNN converges to 
the global optimum of any surrogate risk (Sec. 2.2). So, let 
us consider the surrogate risk ([4]) for any fixed class c = 



Algorithm 2: Algorithm Universal Nearest 
Neighbors UNN(5,?/^) 

Input: S = {{oi,yi),i = 1,2, ...,m, Oi eO, yi e 

{-^,1}^}, meeting (i), (ii), (iii) (Sec.|5}; 

Let r^":^ = / ^'"^^^'^ ^ ^ 
\ otherwise ' 

Mi, 3 = l,2,...,m, c= 1,2,...,C; 

forc= 1,2, ...,Cdo 

Letajc^O, Vj = 1, 2, m; 

LQtwi i V^(0) G R!p^, Vi = 1,2, ...,m; 

fort = 1,2, ...,Tdo 

[1.0] Let j ^ WIC({1, 2, m}, t); 

[I.l] Let 



4= J2 



(20) 



Let 6j G M solution of: 

m 

Ei^'^* (5.4=' + =0; 

i=l 

[1.2] Vi : j -fc i, let 

[1.3] Let ajc ^ ajc + Sj\ 
Output: hc{o.,) = (^icVic, Vc = 1, 2, C 



V^(0)1 



(21) 



(22) 




Fig. 9 A geometric view of how UNN converges to the global opti- 
mum of (|4}. (See Appendix for details and notations.) 



1,2, ...,C: 



^ m 

et{h,S) = -y2i^{Q{h,i,c)) 



(23) 



Let Wt denote the t^^ weight vector inside the "for c" loop 
of Alg. [2] (assuming is the initialization of vS)\ similarly, 
h\ denotes the t^^ leveraged /c-NN rule obtained after the 
update in [1.3]. The following identity holds, whose prove 
follows from| Nock and Nielsen] ( |20Q9| ): 



^(^(/i^,i,c)) =^ + I)^(0||^ti) , 



(24) 
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Table 1 Three common loss functions and the corresponding solutions Sj of 
1 1 . 1 1 1 is the Li norm.) The rightmost column says whether it is (A)lways the so' 
j are the (S)ame. 



2l| and Wi of jiij . 
Tiuon, or whether it 



(Vector r^^-* designates column j of R*^^^ and 
is when the weights of reciprocal neighbors of 



loss function 



Opt 



^exp ^ exp( — x) 



I log 



3!! 

(c)- 



k;^ exp 



2||r-j"^l|l 



^,(l+exp(-<5,rffi)) 



: log(l + exp(-x)) 



log 



(c)- 



where g{m) = —ip{0) does not depend on the /c-NN rule. 
Eq. ([24]) makes the connection between the real- valued clas- 
sification problem and a geometric problem in the non-metric 
space of weights. Here, we have made use of the following 
notations: il){x) = il)'^{—x), where il)'^{x) = xV^^{x) — 
V^(V^^(x)) is the Legendre conjugate of ?/^; D^(wi\\w[) = 
ip{wi) — ip{w[) — {wi — w[) V^{w'-) is the Bregman di 



Underbraces use ( 24 ) in ( 23 ), and is a leveraged /c-NN rule 
corresponding to w. One obtains that h^^ achieves the global 
minimum of ( [23] ), as claimed. 

The proof sketch is graphically summarized in Figure [9] 
In particular, two crucial Bregman orthogonalities are men- 
tioned ( |Nock and Nielsen] 2Q09| ). The red one symbolizes: 



vergence with generator (Nock and Nielsen 



is related to in such a way that V^(x) = — V^^(— x). 
Eq. ( [24] ) proves in handy as one computes the difference 
sf{hi^^,S) - ef(hi,S). Indeed, using E4\ in ([Si, and 



1 / «=i «=i 



(27) 



computing Sj in (21 ) so as to bring hl_^i from h]., we ob- 
tain: 



which is equivalent to ([25]). The black one on Woo is (26 ). 



(25) 



Proofsketch of Theorem [2] Using developments analogous 
to those of Nock and Nielsen| ( ,2009) , UNN can be shown 
to be equivalent to AdaBoost in which m weak classifiers 
are available, each one being an example. Each weak clas- 
sifier returns a value in { — 1, 0, 1}, where is reserved for 
examples outside the reciprocal neighborhood. Theorem 3 
of |Schapire and Singer] ( [1999} brings in our case: 

^ C T 

But ([23]) is lowerbounded, hence UNN must converge. In £"\h\ ^) < E H ^t"^ ' ^^^^ 



Since Bregman divergences are non negative and meet the 
identity of the indiscernibles, ([25]) implies that steps [I.l] 



[1.3] guarantee the decrease of (23) as long as 5j ^ 0. 



addition, it converges to the global optimum of (23). Since 
predictions for each class are independent, the prove con- 



where z['^^ = X^I^i ^it^ the normalizing coefficient for 

(c) 

each weight vector in UNN. (w\^^ denotes the weight of 
example i at iteration (t, c) of UNN, and the Tilda notation 
refers to weights normalized to unity at each step.) It follows 
that: 



sists in showing that ( 23 ) converges to its global minimum 
for each c. Assume this convergence for the current class, 
c. Then, following [Nock and NieTien| ( [20Q9| , ( [2T] ) and ([22) 
imply that, when any possible 6j = 0, the weight vector, 

say Woo, satisfies R*^^^ = 0, i.e., Woo G kerR*^^^ , and 
Woo is unique. But the kernel of R*^^^ and W, the closure 
of W, are provably Bregman orthogonal ( Nock and Nielsen 
[2Q09| ), thus yielding: 

m m 



.(c) 



1 



W 



(i - 2vWa^) 



< exp 



-w 



jt 



< exp (^-T] (^1 - v^l - 47^^^ < exp(-27^7^) 



met {h^ ,S)—mg i^^t ih^,S) — mg 

m 

+ {wooi\\wi),yw e 



where Wjf~^ 



~(c) + 



~(c)- 



(c) 



~(c)+ / ~(c) + - 



(26) 



>0 



I'^^jt^ . The first inequality uses \ — x < exp(— x), 
and the second the WIA. Since even when the WIA does not 
hold, we still observe Zl ' <\, plugging the last inequality 
in ([28]) yields the statement of the Theorem. 
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Fig. 10 Two examples where UNN corrects misclassifications of fc-NN. The query image is shown in the leftmost column. The 11-nearest prototype images are shown on the right: the first row 
refers to UNN with 20% of retained prototypes {0 = 0.2), whereas the second column refers to classic k-NN classification over all prototypes {9 = 1). Neighbors in the same category as the query 
image are surrounded by black boxes. Votes given by each prototype for the true category (coast) are shown below each image (such values correspond to aid/jc in (|9^, where c is the ground-truth 
category). 



